Abstract
Motivation: Approximate membership query (AMQ) structures such as Cuckoo filters or Bloom filters are widely used for representing large sets of elements. Their lightweight space usage explains their success, mainly as they are the only way to scale hundreds of billions or trillions of elements. However, they suffer by nature from non-avoidable false-positive calls that bias downstream analyses of methods using these data structures.
Results: In this work we propose a simple strategy and its implementation for reducing the false-positive rate of any AMQ data structure indexing \(k\)-mers (words of length k). The method we propose, called findere, enables to speed-up the queries by a factor two and to decrease the false-positive rate by two order of magnitudes. This achievement is done on the fly at query time, without modifying the original indexing data-structure, without generating false-negative calls and with no memory overhead.
This method yields so-called “construction false positives”, but the amount of such false positives is negligible when the method is used within classical parameter ranges. This method, as simple as effective, reduces either the false-positive rate or the space required to represent a set given a user-defined false-positive rate.
Availability: https://github.com/lrobidou/findere.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amid, C., et al.: The European nucleotide archive in 2019. Nucleic Acids Res. 48(D1), D70–D76 (2020)
Bender, M.A., et al.: Don’t thrash: how to cache your hash on flash. Proc. VLDB Endow. 5(11), 1627–1637 (2012)
Benoit, G., et al.: Multiple comparative metagenomics using multiset k-mer counting. PeerJ Comput. Sci. 2, e94 (2016)
Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)
Bray, N.L., Pimentel, H., Melsted, P., Pachter, L.: Near-optimal probabilistic RNA-SEQ quantification. Nat. Biotechnol. 34(5), 525–527 (2016)
Chikhi, R., Holub, J., Medvedev, P.: Data structures to represent a set of k -long DNA sequences. ACM Comput. Surv. 54(1), 1–22 (2021)
Fan, B., Andersen, D.G., Kaminsky, M., Mitzenmacher, M.D.: Cuckoo filter: practically better than bloom. In: Proceedings of the 10th ACM International on Conference on emerging Networking Experiments and Technologies, pp. 75–88 (2014)
HMP Integrative, Proctor, L.M., et al.: The integrative human microbiome project. Nature 569(7758), 641–648 (2019)
Marchet, C., Boucher, C., Puglisi, S.J., Medvedev, P., Salson, M., Chikhi, R.: Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res. 31(1), 1–12 (2021)
Marchet, C., Iqbal, Z., Gautheret, D., Salson, M., Chikhi, R.: REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics, 36(Supplement\_1), i177–i185 (2020)
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17(1), 132 (2016)
Pellow, D., Filippova, D., Kingsford, C.: Improving bloom filter performance on sequence data using k -mer bloom filters. J. Comput. Biol. 24(6), 547–557 (2017)
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLOS Biol. 13(7), e1002195 (2015)
Weaver, S.A., Ray, K.J., Marek, V.W., Mayer, A.J., Walker, A.K.: Satisfiability-based set membership filters. J. Satisf. Boolean Model. Comput. 8(3–4), 129–148 (2014)
Wood, D.E., Jennifer, L., Langmead, B.: Improved metagenomic analysis with Kraken 2. Genome Biol. 20(1), 257 (2019). https://doi.org/10.1186/s13059-019-1891-0
Zielezinski, A., Vinga, S., Almeida, J., Karlowski, W.M.: Alignment-free sequence comparison: benefits, applications, and tools. Genome Biol. 18(1), 186 (2017). https://doi.org/10.1186/s13059-017-1319-7
Acknowledgements
This work used HPC resources from the GenOuest bioinformatics core facility (https://www.genouest.org). The work was funded by ANR SeqDigger (ANR-19-CE45-0008).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Robidou, L., Peterlongo, P. (2021). findere: Fast and Precise Approximate Membership Query. In: Lecroq, T., Touzet, H. (eds) String Processing and Information Retrieval. SPIRE 2021. Lecture Notes in Computer Science(), vol 12944. Springer, Cham. https://doi.org/10.1007/978-3-030-86692-1_13
Download citation
DOI: https://doi.org/10.1007/978-3-030-86692-1_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86691-4
Online ISBN: 978-3-030-86692-1
eBook Packages: Computer ScienceComputer Science (R0)